Guarding from Spurious Discoveries in High Dimension
نویسندگان
چکیده
Many data-mining and statistical machine learning algorithms have been developed to select a subset of covariates to associate with a response variable. Spurious discoveries can easily arise in high-dimensional data analysis due to enormous possibilities of such selections. How can we know statistically our discoveries better than those by chance? In this paper, we define a measure of goodness of spurious fit, which shows how good a response variable can be fitted by an optimally selected subset of covariates under the null model, and propose a simple and effective LAMM algorithm to compute it. It coincides with the maximum spurious correlation for linear models and can be regarded as a generalized maximum spurious correlation. We derive the asymptotic distribution of such goodness of spurious fit for generalized linear models and L1-regression. Such an asymptotic distribution depends on the sample size, ambient dimension, the number of variables used in the fit, and the covariance information. It can be consistently estimated by multiplier bootstrapping and used as a benchmark to guard against spurious discoveries. It can also be applied to model selection, which considers only candidate models with goodness of fits better than those by spurious fits. The theory and method are convincingly illustrated by simulated examples and an application to the binary outcomes from German Neuroblastoma Trials.
منابع مشابه
Guarding against Spurious Discoveries in High Dimensions
Many data-mining and statistical machine learning algorithms have been developed to select a subset of covariates to associate with a response variable. Spurious discoveries can easily arise in high-dimensional data analysis due to enormous possibilities of such selections. How can we know statistically our discoveries better than those by chance? In this paper, we define a measure of goodness ...
متن کاملVC-Dimension of Visibility on Terrains
A guarding problem can naturally be modeled as a set system (U ,S) in which the universe U of elements is the set of points we need to guard and our collection S of sets contains, for each potential guard g, the set of points from U seen by g. We prove bounds on the maximum VC-dimension of set systems associated with guarding both 1.5D terrains (monotone chains) and 2.5D terrains (polygonal ter...
متن کاملEvaluation of verbal evidence on the prohibition of Human Cloning
Human cloning or human reproduction through nonsexual transplantation is an interdisciplinary issue that has been discussed by researchers from different dimensions, including theological dimension and perspective of science, religion, and health. Since some scientific discoveries, such as human simulation, are related to human beliefs, several religious scholars have shown special sensitivity ...
متن کاملImproved Approximation for Guarding Simple Galleries from the Perimeter
We provide an O(log log OPT)-approximation algorithm for the problem of guarding a simple polygon with guards on the perimeter. We first design a polynomial-time algorithm for building ε-nets of size O ( 1 ε log log 1 ε ) for the instances of Hitting Set associated with our guarding problem. We then apply the technique of Brönnimann and Goodrich to build an approximation algorithm from this ε-n...
متن کاملFuzzy Model of Human’s Performance for Guarding a Territory in an Air Combat
This paper proposes a new method for a three dimensional fuzzy model of pilot's performance for guarding a territory with a short-distance between two aircraft in an air combat task with a gun. A third-order nonlinear point mass vehicle model is considered for an aircraft's flight dynamics. The desired value of the velocity, the flight path and the heading angles are obtained from some derived ...
متن کامل